I am trying to finish the discussion of the phyloglmm paper and focus on writing about why we / ecologist care/ would want to do this. The terms “phylogenetic variation” or “phylogenetic signal” have always been confusing to me. The definition that is often used is “Phylogenetic signal is a measure of the statistical dependence among species’ trait values due to their phylogenetic relationships”. After hearing Dr. Susan Holmes’ talk, I really like how she phrased the question: “Are the data tree-like?” Instead of looking at the data and ask if they are tree-like or what kind of tree we can build from the observed “trait values/abundance and etc”, I am interested in the question (for the sake of finishing phyloglmm): Given a known phylogenetic tree, what kind of data can we generate and is it obvious that they look “tree-like”?

I still find the definition vague and I often take it as: “species that are closely related have similar trait values”, which is both true and not true to some degree.

Example

Let’s assume evolutionary process is a Brownian-motion process, which means that evolution of a trait evolve independently, following a standard Brownian-motion, along each branch of the phylogeny. The phylogenetic variability of a particular species can be written as the sum of the variances of evolutionary changes that occurred on all of the branches in its history.

Consider this phylogenetic tree:

“Tree-like” data from this extreme example should be “very” easy to identify. We expect t1 and t2 to look similar (as stated above) and t3 to look different than t1 and t2.

Simulating one set of data that corresponds to this tree (this is what people usually have, a set of measurements for a bunch of species) look like this. All the examples here do not include observation error (we have observation error in the phyloglmm paper) for simplicity.

Ok… maybe there is a bug in the code and MLi doesn’t know what he is doing (very common), because it doesn’t look like the tree. Now let’s look at 100 sets of observations from the same tree. Note, each “line” represents a trait that is observed across species/tips.

Ok, that looks a lot better! The lines between t1 and t2 are more “horizontal” because we expect high correlation between them. the lines connected to t3 are all over the place because t3 have zero correlation with t1 and t2. Going back to the definition, phylogenetic variability is the “potential” to look different from the ancestral node. t3 has a very short branch, so it doesn’t have large potential to look very different, whereas t1 and t2 have a long branch, can potentially look very different.

Drawing the potential sampling distributions side by side for each species look like this:

Star phylogeny

Repeating the same thing for a star phylogeny:

Random tree

Repeating the same thing for a random tree.

Thoughts and tons of questions

Going back to the question: “Are the data tree-like?” (I don’t know) or “Does the data look like my tree?” (I don’t know). We can see in the above examples, we can generate data that are not tree-like from the tree. The danger is, these non-tree-like simulations look like noise and given we did not add observation error, it is almost (I think) impossible to disentangle “phylogenetic noise” vs observation error if observation errors are present.

So what proportion of measurements simulated from the “null/true tree” are “tree-like”? Another way to think about it is what proportion are “not tree-like”? A first guess is near the center. For example:

This gray ribbon-band is two standard deviation away from the center using the minimum variance from the phylogenetic variance matrix. This looks good at first, but this is also the region with the highest probability evaluated at NVM(0,phylo_var).

Likelihood surface approach

Note: I am certain I am reinventing the wheel, but let just see how far I can go starting from the basics.

Using the extreme tree presented above, given a set of observed trait (triplets, simulated from the tree), we can compute the likelihood of the given the variance of the phylogenetic tree. We need to decide on a null model/tree to compare it with. The most logical null/tree is the tree that preserves the branch lengths (i.e. the diagonals remain the same) and assume there isn’t any shared information (i.e. the off diagonals are all zero). The null tree looks like this:

The way to think about “phylogenetic signal” is the amount of correlation we see each pair of species (distance?). The distribution plot above just shows the amount of change each species evolve compared to the ancestral node. Thus, the null tree will have the same species level trait distribution, but what we are interested in are the correlation (crosses).

Given a set of traits, we can now calculate the likelihood under the known/true tree (with the covariance) and the null tree (without the covariance). We can do a likelihood ratio test and compute the Chisq statistic. TODO: What is the “right” statistic? I am using Chisq for the easy examples and using Z when I am lazy.

## [1] 0.2806613

Approximately 30% of the traits are are “tree-like” by LRT.

More Data!

Statistics 101, can we see it more clearly if we have more data? There are two ways to increase data:

Increase more species

Before I simulate a larger tree, I just want to test the less extreme trees.

## [1] 0.384

Note: Maybe the extreme examples are not good? We still see a large area in the likelihood surface that aren’t tree-like.

Brute forcing the grid is not a good idea and very inefficient when we increase the dimension. The grid itself is not a good representation of the sampling distribution anyway, thus when calculating the proportion of tree-like, it decreases as we increase the number of species.

Can we get away with just sampling traits from the tree?

Here I sample 10,000 points from the n-dimension space using a MVN with the phylogenetic variance. It is surprising how fast it levels off.

Increasing number of traits

The other way to increase data is to collect different types of traits.

Here, I simulate 1000 sets of traits (lines) and did the LRT for each line. The first plot is consistent with the result before we see a small proportion of tree-like data. In order to “increase” more data, I computated did the likelihood ratio test given we saw multiple sets of traits. We can easily do this by adding the log likelihood for each set of traits under the NVM under the true tree and null tree and do the corresponding LRT. Note: Order does matter! We need a random order!

We can see that if we combine multiple traits, it increases our chance to detect if the data are tree-like via LRT.

Just a exteme example, what if we combine a bunch of non-tree-like traits? By ordering the